ZFS is a combined file system and volume manager that actively protects your data. It replaces lots of software you may be using such as LVM, RAID, and backup applications.
This is a set of notes I use to remember how to manage my ZFS configuration. You can find dozens of other ZFS tutorials and manuals. Most of them are better. Mine is shorter because it's far from comprehensive.
RAID levels are used to describe how an array of devices can be managed as one big storage resource. To understand ZFS, we only need a few:
A stripe has no redundancy. Storage for each file will be spread out across all the devices. If any device fails, the entire file system on the array will be compromised. Stripes are used to gain speed: All the devices work in parallel to transfer one file.
In a mirror with N devices, N-1 devices can fail and the data can still be recovered. All the devices are the same size, so the capacity of the array is the size of one device. The performance of a mirror is the same as that of one component device.
A parity array is more complex: The data is striped along with parity information. In RAID 5, data can be recovered if one device fails. In RAID 6, two devices may fail. Parity information takes up space: If all the devices in a RAID 5 array are the same size "X", one X will be required for the distributed parity information. For RAID 6, 2X will be consumed for parity. A parity array is slower than a mirror, but it provides more storage space with the same number of drives.
RAID 10 is used to get the redundancy of a mirror with more performance: It's created from an N-stripe of M-mirrors. By fussing with N and M, safety, size, and speed can be balanced for a given application.
ZFS supports:
Each raid array can have up to 3 level parity. They are named:
Simple vdevs are disk drives, partitions, or files:
/dev/sdb
/dev/sdb1
/home/myself/myGreatBigFile
Complex vdevs are specified using an expression:
<type> device1 ... deviceN
The type is one of seven keywords taken from one of two groups:
Storage vdevs:
Optimization vdevs:
Storage vdevs, as the name suggests, are where you get space for your files. Optimization vdevs are used to optimize pool performance or reliability. They are optional.
In some contexts, zfs commands require the use of simple vdevs. We will denote these as devices, reserving the term vdev for contexts where a complex or a simple device may be specified.
Pools are the top-level zfs construct for managing a collection of virtual devices. A pool is created by specifying a stripe of vdevs. Space for datasets is allocated dynamically from all the storage vdevs in the pool.
Each top-level vdev in a pool is allowed to be a different type and/or size, but this is seldom (if ever) a good idea. The most common redundant pool organizations are:
You cannot create mirrors of raid arrays or raid arrays of mirrors, etc. Only the types listed in the previous section are allowed (for now.)
There are four types:
By default, when you create a new dataset, ZFS also creates a filesystem which manages space automatically. When you create a file in that filesystem, it is likely that bits of it are stored on every storage vdev in the pool. When you delete a file, the blocks are returned to the pool.
A volume, also called a zvol, is a large fixed-sized dataset formatted with any filesystem you prefer. Zvols are often used as disk drives for virtual machines. When a virtual machine is no longer needed, the volume can be destroyed, returning all the space to the pool.
A snapshot is conceptually a read-only copy of another parent dataset. Snapshots are implemented in such a way that they can be created nearly instantaneously and take very little space. A snapshot depends on the continued existence of the parent.
A clone is a snapshot you can modify. Clones grow as they are modified. Like snapshots, they depend on their parent.
A good policy to follow: Never create a clone you don't intend to destroy or promote in just a few days. If you actively use a clone, it will diverge until it consumes as much space as the parent with added disadvantage that you can never destroy the parent.
I mention this caveat because many beginners think it would be cool to have a pristine installation of Windows or Linux that gets cloned for use by virtual machines. It doesn't end well.
ZFS is a foreign kernel module that gets rebuilt automatically by dkms whenever your kernel is updated. Consequently, it depends on the kernel-devel package.
Install kernel-devel:
yum install kernel-devel
Install a link to the zfs package repository:
yum localinstall --nogpgcheck \
http://archive.zfsonlinux.org/fedora/zfs-release$(rpm \
-E %dist).noarch.rpm
Install zfs:
yum install zfs
THIS NEEDS REVISION. (There is a lot of misinformation on the web about this topic, perhaps because the behavior of ZFS-on-Linux has changed.)
At one time, the standard advice was to build pools from unpartitioned devices. There are many discussions on the web about the merits of this convention. But it's no longer true: ZFS now partitions unpartitioned devices automatically. I believe this is the reason:
Drives sold by different manufacturers with the same nominal size may have slightly different physical sizes. If you need to replace a device in an existing pool, the new device must not be smaller. If it is even one byte smaller, it cannot be used. By introducing a "spacer" partition, ZFS can adjust the size of the ZFS data partition so the new disk size will match the old ones.
To let ZFS have its way, it's best to create a pool with drives that have no partitions. When you create a pool, zfs will create two paritions:
/dev/sdx1 : The zfs data partition
/des/sdx9 : The spacer partition
If you intend to boot a UEFI operating system on a ZFS device, you will have to create partitions by hand. This is covered in "Reliably boot Fedora with root on ZFS."
To clean up an old disk that might have been part of zfs array, you should first use gdisk to list the partitions. Common partition types used for zfs are:
FreeBSD ZFS (0xa504)
TBD
If you see one of these types, exit gdisk and clear the ZFS label:
zpool lableclear -f /dev/sdxN
After that, use gdisk to remove all the partitions.
ZFS wants a lot of memory. The recommended minimum is 1G per terabyte of storage in your pool(s). Without constraints, ZFS is likely to run off with all your memory and sell it at a pawn shop. To prevent this, you should specify a memory limit. This is done using a module parameter. A reasonable setting is half your total memory:
Edit or create:
/etc/modprobe.d/zfs.conf
Add this line:
options zfs zfs_arc_max=17179869184
The size is in bytes and must be a power of 2:
64GB = 68719476736
32GB = 34359738368
16GB = 17179869184
8GB = 8589934592
4GB = 4294967296
2GB = 2147483648
1GB = 1073741824
500MB = 536870912
250MB = 268435456
ZFS "wants" you to use ECC memory, which is typically only available on server class motherboards with Intel Xeon processors.
If you don't use ECC memory, you are taking a risk. Just how big a risk is the subject of considerable controversy and beyond the scope of these notes. Feelings run high on this topic:
"When you run a NAS without ECC and with software RAID, people tell legends of your idiocy which survive the ages. Statues will be built to your folly. Children centuries in the future will read of your pure, unbridled stupidity as something to herd their tender aspirations away from the path you once trod."
- Hat Monster
First, give yourself a fighting chance by testing your memory. Obtain a copy if this utility and run a 24 hour test:
http://www.memtest86.com
This is particularly important on a new server because memory with defects is often purchased with those defects. If your memory passes the test, there is a good chance it will be reliable for some time.
They didn't use ECC memory...
You might be tempted to keep your server in a 1-meter thick lead vault buried 45 miles underground. It turns out that many of the radioactive sources for memory-damaging particles are already in the ceramics used to package integrated circuits. The cost and effort would likely be wasted.
Instead, we're going to enable the unsupported ZFS_DEBUG_MODIFY flag. This will mitigate, but not eliminate, the risk of using ordinary memory.
Edit:
/etc/modprobe.d/zfs.conf
Add the line:
options zfs zfs_flags=0x10
Reboot to be sure it "takes".
You can see the current value here:
cat /sys/module/zfs/parameters/zfs_flags
The rest of this guide describes only two commands:
The zpool command is used to create and configure pools.
The basic pattern:
zpool create myPool vdev1 vdev2 ... vdevN (options)
This expression creates a stripe. Blocks for datasets will be allocated across all the vdevs. If each vdev is a simple device, the pool will be vulnerable: If one device fails, the whole pool is lost. Consequently, the vdevs are usually mirrors or raidz arrays.
There are many options, but I want to mention one right now because it is irreversible if you get it wrong:
-o ashift=12
This specifies that your disk is Advanced Format, which is the same as saying it has 4096 byte sectors instead of the old 512 byte sectors. Most disks made after 2011 are advanced format so you'll need this option most of the time. If you forget, ZFS assumes the sector size is 512. If that's the wrong answer, you'll take a big performance hit. More details about this are covered later. I won't show this option in the examples, because it clutters up the logic. But don't forget!
You can add more vdevs to a pool anytime:
zpool add myPool vdev1 ... vdevN
You can only remove log, cache or spare vdevs:
zpool remove myPool aVdev
Blocks for datasets are allocated evenly across all the top-level storage vdevs. This is why you can't remove a top-level storage vdev.
Simple vdevs are specified using path notation. ZFS puts "/dev/" in front a path name unless the path begins with "/".
You can specify devices like this:
sdb sdc sdd sde
Or using the full path:
/dev/sdb /dev/sdc /dev/sdd /dev/sde
File vdevs must have absolute paths:
/home/aFile /var/temp/goop
File vdevs are used mostly (always?) for experimentation.
zpool create myPool sdb sdc
zpool create myPool mirror sdb sdc
zpool create myPool raidz sdb sdc sdd sde
zpool create myPool mirror sdb sdc mirror sdd sde
zpool create myPool raidz sda sdb sdc raidz sdd sde sdf
This kind of pool is used to experiment without using disk drives. An example:
Create some empty files. (The minimum size of a file vdev is 64M.)
dd if=/dev/zero of=myFile1 bs=1M count=64
dd if=/dev/zero of=myFile2 bs=1M count=64
Create the pool:
zpool create myPool ~/myFile1 ~/myfile2
The paths must be be absolute.
We start with a pool that has one device:
zpool create myPool sdb
Then we attach another, creating a mirror:
zpool attach myPool sdb sdc
We can add another, continuing from the previous example:
zpool attach myPool sdc sdd
You always specify the last vdev in the mirror followed by the new one.
Continuing from the previous example, we remove the "end" device:
zpool detach myPool sdd
Now the mirror has only sdb and sdc.
Continuing from the previous example:
zpool detach myPool sdc
Now only sdb is left.
By default a new pool is mounted at the root of the file system where it appears as a directory named after the pool.
You can specify an alternative mount point for the pool when creating:
zpool create myPool ... -m aPath
In this expression, aPath is a regular file system path or the special keyword none. If none is specified, the pool will not be mounted. The last element of the path is the name for the actual mountpoint, not just the parent path. Consequently, you can use any name for the mountpoint. If an empty directory exsits at that location with a conflicting name, zfs will mount the pool on that directory. Otherwise, a new mount point is created.
You can set or change the mountpoint of a pool anytime later:
zfs set mountpoint=aPath myPool
If the pool was already mounted, the old mount point will be removed.
Note that we are using the command zfs instead of zpool. That's because the pool itself is a dataset, which will be discussed in more detail later.
You can add one or more spares at any time:
zpool add myPool spare /dev/sdf
Spares can be used with any kind of pool.
A chain is only as strong as its weakest link. Suppose you have a raidz pool:
zpool create myPool raidz sdb sdc
Now you add a device:
zpool add myPool sdd
The raidz is spoiled because blocks for files are allocated from all the top level devices, some from the raidz vdev and some from the device sdd1.
zpool status
For the whole pool
zpool iostat
Also individual vdevs
zpool iostat -v
Continuous monitor every 5 seconds
zpool iostat -v 5
zpool history myPool
zpool destroy myPool
After exporting a pool, you can remove the devices and install them in another computer:
zpool export myPool
Exporting a pool unmounts all the filesystems and "offlines" all the devices, so it is sometimes used in other situations when you want to stop all access to the pool.
On the destination machine, you resurrect the pool by importing:
zpool import myPool
After the import, all mountable datasets and filesystem will be mounted in their original locations. Sometimes this is inconvenient. You can arrange for all new new mountpoints to be under a new path using the altroot property:
zpool import myPool -o altroot=/my/new/path
If you export a pool created using file vdevs, there is no place for them to store their parent directory. To import such a pool, you must specify the parent like this:
zpool import myPool -d ~/zfsPlayroom
First export the pool:
zpool export myPool
Then import it and specify a new name:
zpool import myPool myNewName
These can be specified when importing a pool using -o or by using the set command. The default value is indicated first when there are alternatives:
altroot=path
readonly=off | on
autoreplace=off | on
ashift=12
Example:
set readonly=on myPool
These are read-only:
health
size
free
capacity
allocated
free + allocated = size
health = ONLINE | DEGRADED | FAULTED | OFFLINE | REMOVED | UNAVAIL
When you create a pool, it's easy to type and remember device names. These are the familiar "/dev/sdx" names. But there is a serious problem with device names: For scsi or sata disks, they are associated with the port where you plugged in the cable. And for USB disks and other removable devices, they are determined by the order they were plugged in. If you import a pool with the cables switched, disaster follows. And if you move the disks to another machine and try import them, there will likely be device name conflicts.
To avoid all this, it's better to switch to names that are associated with the volume (the physical disk) rather than the connection. Linux provides several choices. They are listed in the /dev/disk directory. There are only two choices that really useful: device IDs and UUIDs.
Device IDs are composed from the disk model number and serial number. When I have to replace a disk, I can be sure I got the right one because the serial number is printed on the paper label and "zpool status" will show the ID.
To make the switch to device ids:
zpool export myPool
zpool import myPool -d /dev/disk/by-id
You can also use the very fashionable UUIDs:
zpool export myPool
zpool import myPool -d /dev/disk/by-id
The good thing about UUIDs (and device IDs) is that they aren't optional: a disk always has both. A bad thing about UUIDs and device IDs is that they are far too long and complex to type or remember. That's why I usually partition and assemble disks using device name and then switch to device IDs.
You can switch back to the old device names like this:
zpool export myPool
zpool import myPool -d /dev
You can switch device names anytime. They become part of the data structures on disk the next time you export the pool or shut down.
If you have a lot of drives in many different racks, locating a drive given the device ID isn't easy. ZFS provides a way to assign your own alias names. You specify alias names in this file:
/etc/zfs/vdev_id.conf
In the following example, we have two 4-bay enclosures that contain pools "mrpool" and "mrback."
Run "zpool status" and capture the output in a prototype vdev_id.conf file. Edit the file so it looks like this:
alias r1c1_7TNO ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E2YR7TN0-part1
alias r1c2_8PT2 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E4FP8PT2-part1
alias r1c3_VNSR ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E7ADVNSR-part1
alias r1c4_FNN3 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E3FLFNN3-part1
alias r2c1_4CX4 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1XS4CXA-part1
alias r2c2_AXC9 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E1EFAXC9-part1
alias r2c3_3Y4P ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5KJ3Y4P-part1
alias r2c4_3C3J ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5KJ3C3J-part1
In this example, the short name was made by combining location in the box (row/col) with the last four characters of the device id, which is part of the drive serial number printed on the disk label. If a drive is reported as defective, you have the location and a positive id you can physically see.
After creating vdev_id.conf, run:
udevadm trigger
Now you can see you new alias shortcuts in:
/dev/disk/by-vdev
When you confirm that all is well in /dev/disk/by-vdev, you can rename the drives in your pools:
zpool export mrpool
zpool import -d /dev/disk/by-vdev mrpool
Now when you run "zpool status" you'll see the alias names.
These aliases can be configured and used before a pool even exists. You can use the alias names when creating a pool and this is a good way to make sure you're using the correct devices. Always remember that linux device names e.g. "/dev/sdx" can change after rebooting even if you didn't deliberately add or remove anything. It can, for example, depend on the order the drives power up. Or if a device fails to start, all subsequent names in the scan order will change.
By default, the mysterious "ZIL" (ZFS intent log) is created in the storage pool. A separate log device can be specified to improve performance. This can be done when a pool is created:
zpool create fastPool /dev/sdb3 log /dev/myOtherSSD
Or it can be added later:
zpool add fastPool log /dev/myOtherSSD
Log vdevs can be redundant:
zpool add fastPool log mirror /dev/ssd1 /dev/ssd2
You can attach a device to a mirrored log vdev to increase the level:
zfs attach fastPool /dev/ssd2 /dev/sdd3
You can also attach a device to a non-mirrored log vdev to create a mirror.
ZFS always uses the memory-based "ARC" cache. It is possible to add one or more secondary cache vdevs. In ZFS-speak, this is called an L2ARC. These are usually fast devices such as SSDs.
Cache devices can be specified when a pool is created:
zpool create fastPool /dev/sdb3 cache /dev/mySSD
Or they can be added later:
zpool add fastPool cache /dev/mySSD
Cache vdevs cannot be redundant
Scrubbing verifies all checksums and fixes any files that fail.
zpool scrub myPool
Although the command returns immediately, the scrub itself my take many hours or even days to complete. Use "zpool status" to view the progress. You can stop a scrub using:
zpool scrub -s myPool
The zed daemon is part of the zfs installation and runs all the time. There is no explicit systemd enable/disable script.
The configuration file is:
/etc/zfs/zed.d/zed.rc
One essential change is to specify your email address by uncommenting the line:
ZED_EMAIL_ADDER="root"
You can change "root" to your own email address or forward messages to root in /etc/aliases.
To reload the configuration file after editing, you must "HUP" the daamon. First get the process id number for zed:
ps ax | grep zed
Then send the HUP signal:
kill -HUP <pid>
Zed can do lots of nifty things like replacing defective drives with hot spares automatically. Caution: It might be better to take care of backups before you swap in a spare and start resilvering.
Mirror and raidz vdevs can be repaired when a device fails:
As soon as the new device comes online, it will start resilvering - the data will be restored using other drives in the mirror or raidz. You can inspect the progress of resilvering using zpool status.
zpool offline myPool aDrive
zpool offline -t myPool aDrive
zpool replace myPool oldDrive newDrive
The newDrive can be a hot spare or any unused drive.
If newDrive is not specified, the oldDrive will be used: That can occur when oldDrive is part of a mirror or raidz pool and it has been replaced by a new drive with the same device path.
ZFS can "tell" if a drive was part of a zfs pool. It will object if you try to add such a drive online in a new roll. To force the issue, add the "-f" option to the replace command.
zpool online myPool aDrive
If the pool property autoreplace is "on" and spare drive is part of the pool, the defective drive will be replaced by the spare and resilvering will start automatically. Using this option seems attractive, but frequently it is better to attend to your backups before starting a lengthy resilvering process.
I recently replaced a device in a raidz2 pool. Here are the details.
The pool device names are from /dev/disk/by-id. The bad disk (which still works, but simply has too many errors to make me comfortable) shows up at:
/dev/disk/by-id/ata-WDC...3C3J-part1
Currently, this is at device name:
/dev/sde
First, I preserve the partition structure:
sfdisk -d /dev/sde > oldmap.dat
The oldmap.dat file is plain text. Edit the file and note the line in the header that assigns a value to "label-id" Also note the expression that assigns a value to "uuid" in the last line of the file. Here is a sample:
label: gpt
label-id: B303994D-9CCE-4732-AA3F-B3755BFD2E69
device: /dev/sdh
unit: sectors
first-lba: 34
last-lba: 7814037134
/dev/sdh1 : start= 2048, size= 7759462400, \
type=516E7CBA-6ECF-11D6-8FF8-00022D09712B, \
uuid=B973915D-7169-4377-ACD2-58285E608D73, \
name="FreeBSD ZFS"
In this example, I added the line continuation marks for illustration only. Now remove the entire line that assigns label-id and cut out the assignment expression for uuid so the file now looks like this:
label: gpt
device: /dev/sdh
unit: sectors
first-lba: 34
last-lba: 7814037134
/dev/sdh1 : start= 2048, size= 7759462400, \
type=516E7CBA-6ECF-11D6-8FF8-00022D09712B, \
name="FreeBSD ZFS"
As noted before DO NOT use line continuations "\" in the real file.
If your device is too clobbered to read the parition map, you could use another device from the pool if they were all partitioned the same way.
Take the device offline with respect to zfs:
zpool offline mrPool /dev/disk/by-id/ata-WDC...3C3J-part1
Now shutdown the computer, remove the device, and put in the new one. Before you slide in the new device, note the last four letters of the serial number. In this case, they were:
0545
Boot up the computer and list the device id mapping to confirm that the new drive is still at /dev/sde:
ls -l /dev/disk/by-id | grep 0535
(It was.)
Now partition the new device like the old one using your edited map file:
sfdisk /dev/sde < oldmap.dat
The sfdisk utility will generate a new label-id and uuid.
List the device id again so you can copy the parition id to the clipboard to avoid typing it later:
ls -l /dev/disk/by-id | grep 0545 | grep part
Tell zfs to replace the old device with the new one:
zpool replace mrPool \
/dev/disk/by-id/ata-WDC...3C3J-part1 \
/dev/disk/by-id/ata-WDC...0545-part1
Confirm that resilvering has started:
zpool status mrPool
The status shows:
state: DEGRADED
status: One or more devices is currently being resilvered.
The pool will continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Aug 27 09:11:41 2018
57.5G scanned out of 21.1T at 207M/s, 29h41m to go
6.98G resilvered, 0.27% done
...
replacing-7 DEGRADED 0 0 0
/dev/disk/by-id/...3C3J-part1 OFFLINE 0 0
/dev/disk/by-id/...0545-part1 ONLINE 0 0 0 (resilvering)
As we stated earlier, ZFS will operate a lot faster if it knows the physical sector size of your disk(s). If you're building a big storage array, it's worth spending a little time to get this right.
Unfortunately, ZFS cannot reliably detect the sector size. To make matters worse, many disks lie about their size. The history of this peculiar behavior is beyond the scope of this article.
If you take the trouble to find your disk on the manufacturer's web site, they will often reveal the physical sector size. Otherwise, they may state that the disk has the Advanced Format, which is marketing-speak for 4096 byte sectors. It might even be printed on the box.
If that's too much trouble, here's a heuristic that's likely to work: First, ask the disk: You can use fdisk:
fdisk /dev/sdx
Or the more impressive:
lsblk -o NAME,PHY-SeC
You'll get back either 512 or 4096. Now use this heuristic:
IF the reported sector size is 4096 THEN
The true sector size is 4096.
ELSE IF the disk was made before 2010 THEN
The true sector size is 512.
ELSE IF the disk was made after 2011 THEN
The true sector size is probably 4096.
ELSE
Do you really want to use this old P.O.S.?
END IF
The option to specify a sector size of 4096 is:
-o ashift=12
The "12" here is the power of 2 that makes 4096. We're all geeks in here you see.
Example: Creating a pool with specified sector size:
zpool create mrnas -o ashift=12 sdb1 sdc1 sdd1 sde1
This optimization is only effective if all the disks have the same sector size.
The zfs command is used to create and configure datasets.
A dataset path always begins with a pool name:
zfs create myPool/myDataset
Any hierarchy of datasets can be created:
zfs create myPool/myDataset/mySubset1/mySubset2
Each dataset to the right is called a descendant of the datasets to the left.
If the pool was created with the default mountpoint, datasets are automatically mounted under the pool with the same path used to create them.
Alternatively, you can specify a mount point when a dataset is created using a property:
zfs create myPool/myDataset mountpoint=aPath
In this expression, aPath is any filesystem path or one of the keywords none or legacy. If none is specified, the dataset will not be mounted. If legacy is specfied, you can mount the dataset using the regular mount command:
mount -t zfs myPool/myDataset /mnt/here
Or in /etc/fstab:
myPool/myDataset /mnt/myStuff zfs
Or in /etc/auto.mount:
myStuff -fstype=zfs :myPool/myDataset
You can change the mountpoint of a dataset anytime:
zfs set mountpoint=aPath
The old mountpoint will be removed and a new one created.
To restore the default mountpoint behavior, first give the pool a mountpoint at the root:
zfs set mountpoint=/myPool myPool
Then change the mountpoints of each dataset you've created to be inherited:
zfs inherit mountpoint myPool/myDataset1
...
zfs inherit mountpoint myPool/myDatasetN
When switching back to inherited mountpoints, I found it necessary to delete the old mountpoint directories by hand. Perhaps this is a bug in ZFS for Linux?
You can use datasets for non-zfs filesystems. These are called "volumes."
This will create a blank 32G volume:
zfs create myPool/myVolume -V 32G
A new device entry will be created automatically at:
/dev/zd0
The next volume you create will be associated with:
/dev/zd1
And so on each time you add a new volume.
You can see the association between zvol devices and datasets here:
/dev/zvol/myPool/myVolume -> ../../zd0
You can now (optionally) partition the device:
fdisk /dev/zvol/myPool/myVolume
Create a filesystem:
mkfs -t ext4 /dev/zvol/myPool/myVolume
And mount the volume:
mount /dev/zvol/myPool/myVolume /mnt/here
In these examples, /dev/zd0 could be used in place of the longer symlink path, but it's easy to lose track of these associations if you have many volumes. It's safer to use the symbolic links under /dev/zvol/...
zfs rename oldPath newPath
Mark a dataset for destruction:
zfs destroy myPool/myDataset
The actual "destruction" is deferred until all descendant datasets, snapshots, and clones are deleted. If you want everything gone immediately:
zfs destroy -R myPool/myDataset
zpool list
Using compression will make zfs faster unless the dataset contains mostly compressed data (such as media.)
Enable default compression scheme:
zfs set compression=on myPool/myDataset
Enable a better compression scheme:
zfs set compression=lz4 myPool/myDataset
Turning on compression only takes effect for files added later.
Although turned on by default, it should usually be turned off:
zfs set atime=off myPool/myDataset
If you don't turn off access time recording, your incremental backups will include every file you've accessed even if it hasn't changed.
IMPORTANT: NFS sharing won't work unless the dataset is mounted locally.
To specify NFS sharing and immediately enable remote access:
zfs set sharenfs=on myPool/myDataset
Linux clients can mount the share from the command line using:
mount -t nfs zfs.host.com:/myDataset /mnt/here
Using /etc/auto.mount the specification is:
myLocalName -fstype=nfs zfs.host.com:/myDataset
Note: The pool name is not part of the nfs path, so it is impossible to share multiple datasets with the same name from different pools on the same host.
Note: It is said to be "auspicious" to avoid sharing zfs datasets using native "/etc/exports" and the sharenfs property: Use one or the other.
IMPORTANT: Samba sharing won't work unless the dataset is mounted locally.
The samba server requires some preparation. Add these directives to the smb.conf global section:
[global]
...
usershare path = /var/lib/samba/usershares
usershare max shares = 100
usershare allow guests = yes
usershare owner only = no
The usershares directory must be created by hand
cd /var/lib/samba
mkdir usershares
chmod o+t usershares
chmod g+w usershares
Make sure that the mountpoint directory for the dataset is owned by the samba guest account user. If you want more restricted access, the procedure should be obvious.
When the configuration is complete, restart:
systemctl restart smb
zfs share -a
To specify samba sharing and immediately enable remote access:
zfs set sharesmb=on myPool/myDataset
After you execute that command, a new share description file should appear in /var/lib/samba/usershares. If you don't see the new file file there, make sure that samba is running and make sure the dataset has a mountpoint property that is a valid file system path. You can't share a dataset with a legacy mountpoint.
Windows clients will see this at the path:
\\zfs.host.com\mypool_mydataset
Notice that the client sees the path components all in lowercase.
To check that samba is working, you can list the shares:
net usershare list
NOTE: It is said to be "auspicious" to avoid sharing zfs datasets using native "smb.conf" and also using "set sharesmb": Use one or the other.
After an NFS or Samba share is enabled, you can disable or enable the sharing for a specified dataset using:
zfs share myPool/myDataset
zfs unshare myPool/myDataset
These operations take effect immediately. But unsharing a dataset will only be effective until the next reboot. At boot time, the startup scripts will mount all configured shares using:
zfs share -a
At shutdown, some other script will run:
zfs unshare -a
To permanently disable a share, turn off one or both share properties:
zfs set sharenfs=off myPool/myDataset
zfs set sharesmb=off myPool/myDataset
Note: Turning off sharenfs will immediately disable remote access, but turning off sharesmb will not. You must explicitly unshare a previously active samba share.
We have seen a few dataset properties in the previous sections. Properties are set using the expression:
set optionName=optionValue myPool/myDataset
Here are a few common properties and values:
mountpoint aPath
compression off | on | lz4
atime on | off
readonly off | on
exec off | on
sharenfs off | on
sharesmb off | on
quota size
reservation size
In these expressions, size can be specified with the M, G, or T suffix. When the option is "off" or "on", the default value is shown first.
Descendant datasets inherit all properties from their parents except the quota and reservation properties.
If you explicitly set a property and later decide it would be better to inherit the value, you can issue the command:
zfs inherit propertyName myPool/myDataset
By adding the "-r" option, all descendants of the specified dataset will inherit the property.
zfs get propertyName myPool/myDataset
zfs get all myPool/myDataset
When you get a property value (as described in the previous section) the listing will show the source of the property, which can be:
default Never specified. The default value is used.
inherited from x Inherited value from parent dataset x.
local The user previously used "set" to specify a value.
temporary The property was specified in a (temporary) -o option when mounting.
received The property was set when the dataset was created by "zfs receive"
(none) The property is read-only.
You can list only properties that have specific source:
zfs get -s local all myPool/myDataset
And you can add -r to list the properties of all children:
zfs get -r all myPool/myDataset
The -s and -r options can be combined.
A property can have both a received and a local (in effect) value.
zfs get -o all myProperty myPool/myDataset
This will show the local and received property values.
Some properties are automatically inherited from parent datasets. Others are not. After changing the value of an otherwise-inherited property, you can restore the inherited value using:
zfs inherit *propertyName* myPool/myDataset
This will replace any previously set value with the value specified somewhere on the parent path.
A snapshot of a dataset behaves like a backup: it contains the dataset frozen in time when the snapshot was created.
Note the special syntax:
zfs snapshot myPool/myDataset@mySnap1
The process of making a snapshot is nearly instantaneous.
All snapshots of a dataset are accessible in a hidden directory below the original dataset mountpoint:
.zfs/snapshot
Inside that directory, you would find, for example, mySnap1 which was created in the previous example.
You can use "zfs rename" on snapshots, but their parent path cannot be changed.
To see snapshots, add an option:
zfs list -t snapshot
Or use:
zfs list -t all
If you need to look for a deleted file in an old snapshot, remember that each dataset has it's own hidden .zfs directory. A parent dataset's .zfs/snapshots may show child mountpoint directories, but they will be empty. Don't panic. You're not looking in the right place.
You can only rename the snapshot, not the dataset path before the "@" symbol:
zfs rename myPool/myDataset@oldName myPool/myDataset@newName
Recursively rename a snapshot and the snapshots of all descendants:
zfs rename -r myPool/myDataset@oldName myPool/myDataset@newName
By default, descendant datasets are not part of snapshot. (Not to be confused with descendant directories in the dataset filesystem, which will be included.) To include descendant datasets, use the -r option. For example,
zfs create myPool/myDataset1
zfs create myPool/myDataset1/myDataset2
This will not include myDataset2:
zfs snapshot myPool/myDataset1@mySnap1
But this will include myDataset2:
zfs snapshot -r myPool/myDataset1@mySnap1
A rollback transforms a dataset back into the state it was in when a snapshot was created:
First make sure the snapshot itself isn't mounted, then:
zfs rollback -rf myPool/myDataset@mySnap
The -r option removes all dependents snapshots of the snapshot back to the time the snapshot was created. Without this option, you can only rollback to the most recent snapshot. Using -R deletes all dependent clones as well. The -f option unmounts the filesystem if necessary.
A clone works like a copy of a snapshot, but you can modify the the files inside:
zfs clone myPool/mySub1/@gleep myPool/myFirstClone
Note: Clones must be destroyed before the parent snapshot and dataset can be destroyed.
You can promote a clone so it become a normal dataset, independent of the orignal snapshot.
zfs promote myPool/myClone
You might want to rename it to reflect it's elevated status:
zfs rename myPool/myClone myPool/myNewDataset
The syntax is the same:
zfs snapshot myPool/myVolume@mySnap
A new device entry is automatically created:
/dev/zvol/myPool/myVolume@mySnap
This can be mounted in the usual way. Read-only status is implicit.
Replication is used to copy a source dataset from one pool to a destination dataset in another pool. The destination dataset may be part of a pool on another host. The most common use of replication is making backups.
First time backup:
zfs snapshot myPool/myDataset@now
zfs send myPool/myDataset@now | zfs receive -d myBack
Later incremental backups:
zfs rename myPool/myDataset@now myPool/source@then
zfs snapshot myPool/myDataset@now
zfs send -i myPool/myDataset@then myPool/myDatasetl@now | zfs receive -dF myBack
zfs destroy myPool/myDataset@then
zfs send options:
-i : Incremental (old and new snapshots are the following parameters)
-I : Same as -i, but all snapshots taken since the previous "now" are included.
zfs receive options:
-d : Discard the pool name from the sending path
-F : Force rollback to most recent snapshot.
In the example above, using -d produces this result on the receiving side:
myBack/myDataset@now
Without -d, you would get:
myBack/myPool/myDataset@now
The -F option does a rollback on the receiving dataset, discarding any changes since the last snapshot. This seems alarming, but the practice is necessary because simply browsing a mounted backup will alter access times (if enable) and zfs will treat the data as modified, forcing things to be copied. If the destination dataset is used only for backups, there shouldn't be any useful changes since the last snapshot.
An alternative to using -F is to make the destination dataset readonly:
zfs set readonly=on myBack/now
If myBack is used exclusively for backups, you can make the whole pool readonly:
zfs set readonly=on myBack
The readonly property will be inherited by all new descendant datasets created under myBack. It only applies to normal filesystem operations, not to the receive command, which will modify the destination dataset to match the source.
By adding a few options, you can backup the entire pool including all descendant datasets.
First time backup:
zfs snapshot -r myPool@now
zfs send -R myPool@now | zfs receive -ud myBack
Later incremental backups:
zfs rename -r myPool@now myPool@then
zfs snapshot -r myPool@now
zfs send -Ri myPool@then myPool@now | zfs receive -uFd myBack
zfs destroy -r myPool@then
zfs rename options:
-r : Recursively rename the snapshots of all descendant datasets.
(Only snapshots can be recursively renamed!)
zfs snapshot options:
-r : Recursively create snapshots of all descendants
zfs send options:
-R : Create a replication package (gets all descendants including snapshots and clones)
-i : Incremental (old and new snapshots are the following parameters)
When doing an incremental send, the "old" snapshot must be the one previously sent.
zfs receive options:
-u : Don't mount anything created on the receiving side.
-d : Discard the pool name from the sending path
When used with -R on the sending side, -F will delete everything that doesn't exist on the sending side. This make the readonly option unnecessary.
The -u option is useful when the datasets being created have non-default mountpoint options. It prevents them from being mounted and possibly overriding existing directories or mountpoints in the receiving filesystem. Unfortunately, the systemd startup scripts will attempt to mount them the next time you reboot. Please see Avoiding mountpoint conflicts..
You mush have ssh configured and working first. In this example, myBack is assumed to exist on the destination host with ip name "destHost". For a non-incremental backup:
zfs send myPool@now | ssh destHost zfs receive -d myBack
By default, when a pool receives a dataset, whatever property values existed on the sending side will be preserved. However it is possible to override this behavior by setting properties in the zfs receive command or by setting them locally after the dataset is received. In this case, the property will have both a local and a received value. If you subsequently clear the local value, the received value will take effect again. See the "zfs get" section above for a listing of possible property sources. In the next section, we cover the most common case for modifying a received property.
When several hosts use zfs replication to backup file systems, there is a potential for mountpoint conflicts on the backup host. If root file systems are backed up, they will almost certainly conflict with mountpoints active on the backup host.
We would usually prefer that none of the mountpoints in the backup dataset be active on the backup host. But if it should ever be necessary to restore a backup, we want the original mountpoints to be remembered and restored. ZFS provides special options for send and recieve to get exactly this effect.
In the following example, the pool "srcPool/srcDataset" will be sent to "dstMachine" which has a pool named "dstPool".
First, on the dstMachine, you must create a parent dataset. It can have the name "srcDataset" to help you remember where it came from:
(On dstMachine)
zfs create dstPool/srcDataset
Now on srcMachine execute:
zfs snapshot -r srcPool/srcDataset@now
zfs send -R srcPool/srcDataset@now | ssh dstMachine \
zfs receive -ud -x mountpoint dstPool/srcDataset
This will replicate srcPool/srcDataset to dstPool/srcDataset so they match except for mountpoints: They will be "latent" but appear to be "none" because that's what they inherit from the newly-created srcDataset on dstMachine.
The "-x mountpoint" options means "don't change the mountpoints on the target system, but remember the mountpoint property value so it will be set if the dataset is ever replicated without the "-x".
Sound a bit complex, but it's exactly what you want when doing backups from many machines to a common backup server pool.
Later incremental backups:
zfs rename -r srcPool/srcDatasetl@now srcPool/srcDataset@then
zfs snapshot -r srcPool/srcDataset@now
zfs send -Ri srcPool/srcDataset@then srcPool/srcDataset@now \
| ssh dstMachine zfs receive -ud -x mountpoint dstPool/srcDataset
zfs destroy -r srcPool/srcDataset@then
New zfs receive option:
-x mountpoint : Remember but don't modify the mountpoint
The effect of -x is to make sure we don't lose the mountpoint value, but keep the local value (previously set using the -o option.) I like to call these "latent" mountpoints. Perhaps there's a more official-sounding zfs term?
zfs get -o all mountpoint myBack/myDataset
This command will show the value of mountpoint locally in effect and also the original received value.
When restoring a backup to the original host, use the "-b" option for *zfs send":
zfs send -R -b myBack/myDataset@now | zfs receive -ud myPool
New zfs send options:
-b : Send previously received property values instead of the local values.
In our example, this will restore myDataset and all the descendant mountpoints to their original state.
In the examples show above, we suppress mountpoints on the backup device using (-x mountpoint) - If you try to access files in one this snapshouts, you'll have to mount the dataset. But that causes a big problem: changing the mountpoint from "none" so some local directory modifies the dataset so the next incremental snapshot from the source will fail. You'll see a message like this:
cannot receive: destination has been modified since most recent snapshot
The right way to access files in a backup snapshot is by using a temporary clone. Here is a an example of accessing the root (user) directory in a backup:
Create a clone
zfs clone mrPool/server/root@now mrPool/gleeb
Give it a local moutpoint
zfs set mountpoint=/gleeb mrPool/gleeb
Now recoever your files from /gleeb
Destroy the clone
zfs destroy mrPool/gleeb
Get rid of the mountpoint directory
rmdir /gleeb
There are other ways to deal with this issue:
1) Using the -F option with "zfz receive": The "-F" will revert all changes (including deleting all snapshots) keeping only the most recent snapshot. The drawback to using -F is part about destroying "all other snapshots" - This can be a problem if you perform secondary backups (replication of a replication) because you'll need to preserve snapshots associated with each incremental stage.
2) Access snapshots through the invisible ".zfs" directory. This is a fine idea if the backup dataset has an active local mountpoint. But when backing up many computers to a common backup pool, having active mountpoints isn't a practical idea.
These commands are useful if you build and install zfs from source or if the modules fail to load during the boot process. Normally, zfs starts using the "zfs.target" under control of systemd. You can do this yourself:
systemctl start zfs.target
Here's what starting the zfs.target will do:
# Load zfs and dependencies:
modprobe zfs
# Import all pools:
zpool import -c /etc/zfs/zpool.cache -aN
# Mount all file systems:
zfs mount -a
# Activate smb and nfs shares:
rm -f /etc/dfs/sharetab
zfs share -a
# Start the zed daemon:
zed -F
Somehow, systemd makes the last command above daemonize zed, but to start it from the command line, I had to use systemctl:
systemctl start zed.target
When you update the linux kernel, dkms is supposed to update the zfs modules. Occasionally, this mechanism fails and your zfs filesystems will be missing after a reboot. To fix this problem, you can force a rebuild using dkms commands, which is somewhat tedious, or you can simply re-install the packages:
yum reinstall spl-dkms
yum reinstall zfs-dkms
Now bring up zfs without rebooting:
systemctl start zfs.target
After doing a major upgrade, e.g. version 22 to 23, ZFS will not start or automatically rebuild the modules even if the associated dkms packages are re-installed. I haven't figured out why this happens, but the fix is easy: Just add and install the modules by hand:
First, make sure your modules are up to date:
dnf update -y
Find out what version you have:
rpm -q spl-dkms
rpm -q zfs-dkms
Usually spl and zfs will have the same version numbers. Ignore the extension part of the version string: For example If "rpm -q" shows something like:
zfs-dkms-0.6.5.4-1.fc23.noarch
Your version string is just:
0.6.5.4
Now you can add and install the modules:
dkms add -m spl -v 0.6.5.4
dkms add -m zfs -v 0.6.5.4
dkms install -m spl -v 0.6.5.4
dkms install -m zfs -v 0.6.5.4
And bring up zfs:
systemctl start zfs.target
After your package manager updates zfs, you may find new features. To avoid problems where there may be multiple hosts that share pools, this process is controlled by a features protocol. You can decide which new pool features to enable or ignore, even if you accept the rest of a software update. This happened to me recently, so I'll outline the process with a specfic example:
After doing a routine "yum update", I got this message from "zpool status":
status: Some supported features are not enabled on the pool. The pool
can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not
support the features. See zpool-features(5) for details.
Ok. Next I tried running:
zpool upgrade
In older versions of zfs, the pool data structure had a version number which would be incremented by the upgrade process. It was an "all or nothing" deal. But that was yesterday. Running "zpool upgrade" now reports:
This system supports ZFS pool feature flags.
All pools are formatted using feature flags.
Some supported features are not enabled on the following pools. Once a
feature is enabled the pool may become incompatible with software
that does not support the feature. See zpool-features(5) for details.
POOL FEATURE
---------------
mrBack
filesystem_limits
large_blocks
mrPool
filesystem_limits
large_blocks
It turns out you have to enable each feature on each of your pools by hand:
zpool set feature@filesystem_limits=enabled mrBack
zpool set feature@large_blocks=enabled mrBack
zpool set feature@filesystem_limits=enabled mrPool
zpool set feature@large_blocks=enabled mrPool
The possible values for a feature are:
This is clearly a better idea: You can decide how to trade off the value of a new feature against potential compatibility problems if you ever want to import your pools on another host.